55 research outputs found
Model-Based Recognition and Parameter Estimation of Buildings from Multi-View Aerial Imagery Using Multi-Segmentation
This paper describes a system for analysis of aerial images of urban areas using multiple images from different viewpoints. In this paper the emphasis is on the discussion of the experimental evaluation using segmented images obtained by applying 3 different parameters in the segmentation-process. The proposed approach combines bottom-up and top-down processing. To evaluate statistically the performance of the system, a set of 50 realisations of 5 images from different viewpoints was used, which was generated by combining real and ray-traced images. The experiments show a significant improvement of reliability and accuracy if multi-segmentation is used in multi-view imagery, instead of single-segmentation
Are current long-term video understanding datasets long-term?
Many real-world applications, from sport analysis to surveillance, benefit
from automatic long-term action recognition. In the current deep learning
paradigm for automatic action recognition, it is imperative that models are
trained and tested on datasets and tasks that evaluate if such models actually
learn and reason over long-term information. In this work, we propose a method
to evaluate how suitable a video dataset is to evaluate models for long-term
action recognition. To this end, we define a long-term action as excluding all
the videos that can be correctly recognized using solely short-term
information. We test this definition on existing long-term classification tasks
on three popular real-world datasets, namely Breakfast, CrossTask and LVU, to
determine if these datasets are truly evaluating long-term recognition. Our
study reveals that these datasets can be effectively solved using shortcuts
based on short-term information. Following this finding, we encourage long-term
action recognition researchers to make use of datasets that need long-term
information to be solved
Video BagNet: short temporal receptive fields increase robustness in long-term action recognition
Previous work on long-term video action recognition relies on deep
3D-convolutional models that have a large temporal receptive field (RF). We
argue that these models are not always the best choice for temporal modeling in
videos. A large temporal receptive field allows the model to encode the exact
sub-action order of a video, which causes a performance decrease when testing
videos have a different sub-action order. In this work, we investigate whether
we can improve the model robustness to the sub-action order by shrinking the
temporal receptive field of action recognition models. For this, we design
Video BagNet, a variant of the 3D ResNet-50 model with the temporal receptive
field size limited to 1, 9, 17 or 33 frames. We analyze Video BagNet on
synthetic and real-world video datasets and experimentally compare models with
varying temporal receptive fields. We find that short receptive fields are
robust to sub-action order changes, while larger temporal receptive fields are
sensitive to the sub-action order
Linear color correction for multiple illumination changes and non-overlapping cameras
Many image processing methods, such as techniques for people re-identification, assume photometric constancy between different images. This study addresses the correction of photometric variations based upon changes in background areas to correct foreground areas. The authors assume a multiple light source model where all light sources can have different colours and will change over time. In training mode, the authors learn per-location relations between foreground and background colour intensities. In correction mode, the authors apply a double linear correction model based on learned relations. This double linear correction includes a dynamic local illumination correction mapping as well as an inter-camera mapping. The authors evaluate their illumination correction by computing the similarity between two images based on the earth mover's distance. The authors compare the results to a representative auto-exposure algorithm found in the recent literature plus a colour correction one based on the inverse-intensity chromaticity. Especially in complex scenarios the authors’ method outperforms these state-of-the-art algorithms
Incremental concept learning with few training examples and hierarchical classification
Object recognition and localization are important to automatically interpret video and allow better querying
on its content. We propose a method for object localization that learns incrementally and addresses four key
aspects. Firstly, we show that for certain applications, recognition is feasible with only a few training samples.
Secondly, we show that novel objects can be added incrementally without retraining existing objects, which is
important for fast interaction. Thirdly, we show that an unbalanced number of positive training samples leads
to biased classi er scores that can be corrected by modifying weights. Fourthly, we show that the detector
performance can deteriorate due to hard-negative mining for similar or closely related classes (e.g., for Barbie
and dress, because the doll is wearing a dress). This can be solved by our hierarchical classi cation. We introduce
a new dataset, which we call TOSO, and use it to demonstrate the e ectiveness of the proposed method for the
localization and recognition of multiple objects in images.This research was performed in the GOOSE project, which is jointly funded by the enabling technology program
Adaptive Multi Sensor Networks (AMSN) and the MIST research program of the Dutch Ministry of Defense.
This publication was supported by the research program Making Sense of Big Data (MSoBD).peer-reviewe
Interactive detection of incrementally learned concepts in images with ranking and semantic query interpretation
This research was performed in the GOOSE project, which is jointly funded by the MIST research program of the Dutch Ministry of Defense and the AMSN enabling technology program.The number of networked cameras is growing
exponentially. Multiple applications in different domains result
in an increasing need to search semantically over video sensor
data. In this paper, we present the GOOSE demonstrator, which
is a real-time general-purpose search engine that allows users to
pose natural language queries to retrieve corresponding images.
Top-down, this demonstrator interprets queries, which are
presented as an intuitive graph to collect user feedback. Bottomup,
the system automatically recognizes and localizes concepts in
images and it can incrementally learn novel concepts. A smart
ranking combines both and allows effective retrieval of relevant
images.peer-reviewe
TNO at TRECVID 2013 : multimedia event detection and instance search
We describe the TNO system and the evaluation results for TRECVID 2013 Multimedia Event Detection (MED) and instance search (INS) tasks. The MED system consists of a bag-of-word (BOW) approach with spatial tiling that uses low-level static and dynamic visual features, an audio feature and high-level concepts. Automatic speech recognition (ASR) and optical character recognition (OCR) are not used in the system. In the MED case with 100 example training videos, support-vector machines (SVM) are trained and fused to detect an event in the test set. In the case with 0 example videos, positive and negative concepts are extracted as keywords from the textual event description and events are detected with the high-level concepts. The MED results show that the SIFT keypoint descriptor is the one which contributes best to the results, fusion of multiple low-level features helps to improve the performance, and the textual event-description chain currently performs poorly. The TNO INS system presents a baseline open-source approach using standard SIFT keypoint detection and exhaustive matching. In order to speed up search times for queries a basic map-reduce scheme is presented to be used on a multi-node cluster. Our INS results show above-median results with acceptable search times.This research for the MED submission was performed in the GOOSE project, which is jointly funded by the enabling technology program Adaptive Multi Sensor Networks (AMSN) and the MIST research program of the Dutch Ministry of Defense. The INS submission was partly supported by the MIME project of the creative industries knowledge and innovation network CLICKNL.peer-reviewe
- …